D² Brier Score (d2_brier_score)#

The Brier score measures how close your predicted probabilities are to the true 0/1 outcomes. The D² Brier score turns that into an explained fraction (similar to \(R^2\)):

  • \(D^2 = 1\) is perfect probability predictions

  • \(D^2 = 0\) is a constant “predict the base rate” baseline

  • \(D^2 < 0\) means worse than the baseline

This notebook covers:

  • intuition + plots (Plotly)

  • the math (LaTeX)

  • a from-scratch NumPy implementation

  • using the metric as an optimization objective (logistic regression trained by minimizing Brier loss)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from plotly.subplots import make_subplots

from sklearn.calibration import calibration_curve
from sklearn.datasets import make_classification
from sklearn.metrics import brier_score_loss, r2_score
from sklearn.model_selection import train_test_split


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Problem setting#

We focus on binary classification.

  • True labels: \(y_i \in \{0, 1\}\)

  • Predicted probability for the positive class: \(p_i \in [0, 1]\) where

\[ p_i \approx \mathbb{P}(y_i = 1 \mid x_i). \]

The key is that we evaluate probabilities, not hard class labels.

2) Brier score (the underlying loss)#

For binary outcomes, the Brier score is just mean squared error in probability space:

\[ \mathrm{BS}(y, p) = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)^2. \]

With non-negative sample weights \(w_i\):

\[ \mathrm{BS}_w(y, p) = \frac{\sum_{i=1}^n w_i (p_i - y_i)^2}{\sum_{i=1}^n w_i}. \]

Interpretation: each example contributes a squared error term.

  • Predict \(p_i \approx 1\) when \(y_i = 1\) and \(p_i \approx 0\) when \(y_i = 0\) → small loss

  • Be very confident and wrong (e.g. \(p_i \approx 1\) but \(y_i = 0\)) → loss close to 1

The Brier score is a proper scoring rule: in expectation, it is minimized by predicting the true probability.

p = np.linspace(0, 1, 401)

fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=(p - 0.0) ** 2, mode="lines", name="y=0"))
fig.add_trace(go.Scatter(x=p, y=(p - 1.0) ** 2, mode="lines", name="y=1"))

fig.update_layout(
    title="Brier loss for a single example: (p - y)^2",
    xaxis_title="Predicted probability p",
    yaxis_title="Squared error",
)
fig.update_xaxes(range=[0, 1])
fig.update_yaxes(range=[0, 1])
fig.show()

3) D² Brier score (fraction of Brier explained)#

The D² Brier score compares your model to a simple reference predictor:

  • Reference probability: the empirical base rate

\[ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i. \]
  • Reference predictions: \(p_i^{\text{ref}} = \bar{y}\) for all \(i\)

The reference Brier score is the Brier score of this constant predictor:

\[ \mathrm{BS}_{\text{ref}} = \frac{1}{n}\sum_{i=1}^n (\bar{y} - y_i)^2. \]

For binary \(y \in \{0,1\}\), this simplifies to:

\[ \mathrm{BS}_{\text{ref}} = \bar{y}(1-\bar{y}). \]

The D² Brier score is then:

\[ D^2_{\text{Brier}} = 1 - \frac{\mathrm{BS}(y, p)}{\mathrm{BS}_{\text{ref}}}. \]

How to read it

  • \(D^2=1\): perfect probabilities (Brier score is 0)

  • \(D^2=0\): no improvement over predicting the base rate

  • \(D^2<0\): worse than the base-rate predictor

Edge case: if all labels are the same, then \(\mathrm{BS}_{\text{ref}} = 0\) and \(D^2\) is undefined. A common convention (used by sklearn.metrics.r2_score with force_finite=True) is:

  • return 1.0 if your predictions are perfect

  • otherwise return 0.0

4) Connection to \(R^2\)#

For binary targets, the D² Brier score is exactly the coefficient of determination computed on \((y, p)\):

\[ D^2_{\text{Brier}} = 1 - \frac{\sum_i (y_i - p_i)^2}{\sum_i (y_i - \bar{y})^2}. \]

That is the same algebraic form as \(R^2\).

This is useful as a sanity check:

from sklearn.metrics import r2_score
d2_brier = r2_score(y_true, y_prob)

scikit-learn exposes brier_score_loss but doesn’t currently ship a d2_brier_score convenience function. You can compute it directly from Brier scores with the base-rate reference:

import numpy as np
from sklearn.metrics import brier_score_loss

p_ref = y_true.mean()
d2_brier = 1 - brier_score_loss(y_true, y_prob) / brier_score_loss(
    y_true,
    np.full_like(y_true, p_ref, dtype=float),
)

But conceptually, D² Brier is about probabilistic classification and is usually discussed together with calibration diagnostics (reliability diagrams).

def brier_score_loss_np(y_true, y_prob, sample_weight=None):
    y_true = np.asarray(y_true, dtype=float)
    y_prob = np.asarray(y_prob, dtype=float)

    err2 = (y_prob - y_true) ** 2

    if sample_weight is None:
        return float(np.mean(err2))

    w = np.asarray(sample_weight, dtype=float)
    return float(np.average(err2, weights=w))


def d2_brier_score_np(y_true, y_prob, sample_weight=None):
    y_true = np.asarray(y_true, dtype=float)
    y_prob = np.asarray(y_prob, dtype=float)

    if sample_weight is None:
        p_ref = float(np.mean(y_true))
    else:
        p_ref = float(np.average(y_true, weights=sample_weight))

    y_ref = np.full_like(y_true, p_ref, dtype=float)

    bs = brier_score_loss_np(y_true, y_prob, sample_weight=sample_weight)
    bs_ref = brier_score_loss_np(y_true, y_ref, sample_weight=sample_weight)

    if np.isclose(bs_ref, 0.0):
        return 1.0 if np.isclose(bs, 0.0) else 0.0

    return 1.0 - bs / bs_ref


def plot_reliability_diagram(y_true, probas_by_name, n_bins=12, title="Reliability diagram"):
    fig = go.Figure()

    fig.add_trace(
        go.Scatter(
            x=[0, 1],
            y=[0, 1],
            mode="lines",
            name="Perfect calibration",
            line=dict(dash="dash"),
        )
    )

    for name, p in probas_by_name.items():
        prob_true, prob_pred = calibration_curve(
            y_true, p, n_bins=n_bins, strategy="uniform"
        )
        fig.add_trace(
            go.Scatter(
                x=prob_pred,
                y=prob_true,
                mode="lines+markers",
                name=name,
            )
        )

    fig.update_layout(
        title=title,
        xaxis_title="Mean predicted probability",
        yaxis_title="Fraction of positives",
        legend_title="Model",
    )
    fig.update_xaxes(range=[0, 1])
    fig.update_yaxes(range=[0, 1], scaleanchor="x", scaleratio=1)
    return fig
# Quick sanity checks
y = rng.integers(0, 2, size=2000)
p = rng.random(size=y.shape[0])

bs_np = brier_score_loss_np(y, p)
bs_sklearn = brier_score_loss(y, p)

d2_np = d2_brier_score_np(y, p)
d2_r2 = r2_score(y, p)

print(f"Brier  (NumPy)  : {bs_np:.6f}")
print(f"Brier  (sklearn): {bs_sklearn:.6f}")
print(f"D² Brier (NumPy): {d2_np:.6f}")
print(f"r2_score        : {d2_r2:.6f}")

assert np.allclose(bs_np, bs_sklearn)
assert np.allclose(d2_np, d2_r2)

# Baseline should give D² = 0
p_base = np.full_like(y, y.mean(), dtype=float)
assert np.allclose(d2_brier_score_np(y, p_base), 0.0)
Brier  (NumPy)  : 0.332926
Brier  (sklearn): 0.332926
D² Brier (NumPy): -0.331711
r2_score        : -0.331711

5) Intuition: calibrated vs. miscalibrated probabilities#

We’ll generate data from a known probability model so we can create different types of predictors:

  • Oracle: predicts the true probability (best possible Brier score)

  • Underconfident: probabilities are pushed toward 0.5

  • Overconfident: probabilities are pushed toward 0 or 1

  • Baseline: always predicts the base rate \(\bar{y}\)

Then we compare Brier and D² Brier and visualize calibration.

def sigmoid(z):
    z = np.asarray(z)
    out = np.empty_like(z, dtype=float)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


n = 6000
x = rng.normal(size=n)

p_true = sigmoid(1.8 * x - 0.2)
y = rng.binomial(1, p_true)

logit_true = np.log(p_true / (1 - p_true))

p_oracle = p_true
p_under = sigmoid(0.5 * logit_true)
p_over = sigmoid(2.0 * logit_true)
p_base = np.full_like(y, y.mean(), dtype=float)

models = {
    "baseline (̅y)": p_base,
    "underconfident": p_under,
    "oracle": p_oracle,
    "overconfident": p_over,
}

names = list(models.keys())
briers = np.array([brier_score_loss_np(y, models[name]) for name in names])
d2s = np.array([d2_brier_score_np(y, models[name]) for name in names])

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Brier score (lower is better)", "D² Brier score (higher is better)"),
)

fig.add_trace(go.Bar(x=names, y=briers), row=1, col=1)
fig.add_trace(go.Bar(x=names, y=d2s), row=1, col=2)

fig.update_layout(title="Same data, different probability quality", showlegend=False)
fig.update_yaxes(title_text="Brier", row=1, col=1)
fig.update_yaxes(title_text="D²", row=1, col=2)
fig.show()
fig = plot_reliability_diagram(
    y_true=y,
    probas_by_name=models,
    title="Reliability diagram: calibration differences show up in Brier/D²",
)
fig.show()

6) Per-example contributions (what drives the score?)#

The Brier score is an average of \((p_i - y_i)^2\). Looking at the worst individual contributions helps you see why a model is doing poorly.

Below we visualize the biggest squared errors for the overconfident predictor.

contrib = (p_over - y) ** 2
idx = np.argsort(contrib)[::-1][:60]

colors = np.where(y[idx] == 1, "#d62728", "#1f77b4")
hover = [f"p={p_over[i]:.3f}<br>y={int(y[i])}<br>(p-y)^2={contrib[i]:.3f}" for i in idx]

fig = go.Figure(
    data=[
        go.Bar(
            x=np.arange(len(idx)),
            y=contrib[idx],
            marker_color=colors,
            hovertext=hover,
            hoverinfo="text",
        )
    ]
)

fig.update_layout(
    title="Largest Brier contributions (overconfident predictor)",
    xaxis_title="Worst examples (rank)",
    yaxis_title="(p - y)^2",
)
fig.show()

7) Using D² Brier for optimization (logistic regression)#

On a fixed dataset, the reference Brier score \(\mathrm{BS}_{\text{ref}}\) depends only on \(y\). So maximizing D² Brier is equivalent to minimizing the Brier score:

\[ D^2_{\text{Brier}} = 1 - \frac{\mathrm{BS}(y,p)}{\mathrm{BS}_{\text{ref}}} \quad \Rightarrow \quad \arg\max D^2_{\text{Brier}} = \arg\min \mathrm{BS}(y,p). \]

We’ll fit a simple logistic regression model:

\[ p(x) = \sigma(w^T x), \quad \sigma(t) = \frac{1}{1 + e^{-t}}. \]

Using the Brier score as the objective:

\[ \mathcal{L}(w) = \frac{1}{n} \sum_{i=1}^n (\sigma(w^T x_i) - y_i)^2. \]

The gradient is (vectorized):

\[ \nabla_w \mathcal{L} = \frac{2}{n} X^T\Big((p-y) \odot p \odot (1-p)\Big) \]

where \(p = \sigma(Xw)\) and \(\odot\) is elementwise multiplication.

We’ll also train the same model with log loss for comparison.

X, y = make_classification(
    n_samples=3500,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=1.2,
    flip_y=0.03,
    random_state=42,
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# Standardize for nicer optimization
mean = X_train.mean(axis=0)
std = X_train.std(axis=0) + 1e-12

X_train = (X_train - mean) / std
X_val = (X_val - mean) / std


def add_intercept(X):
    return np.c_[np.ones((X.shape[0], 1)), X]


X_train_i = add_intercept(X_train)
X_val_i = add_intercept(X_val)

X_train_i.shape, X_val_i.shape
((2450, 3), (1050, 3))
def fit_logistic_gd(
    X_train,
    y_train,
    X_val,
    y_val,
    *,
    loss="brier",
    lr=0.5,
    n_epochs=800,
    l2=0.0,
    seed=0,
):
    rng_local = np.random.default_rng(seed)
    w = rng_local.normal(scale=0.1, size=X_train.shape[1])

    history = {
        "brier_train": np.empty(n_epochs),
        "d2_train": np.empty(n_epochs),
        "brier_val": np.empty(n_epochs),
        "d2_val": np.empty(n_epochs),
    }

    for epoch in range(n_epochs):
        # Forward
        p_train = sigmoid(X_train @ w)
        p_val = sigmoid(X_val @ w)

        # Metrics
        history["brier_train"][epoch] = brier_score_loss_np(y_train, p_train)
        history["d2_train"][epoch] = d2_brier_score_np(y_train, p_train)
        history["brier_val"][epoch] = brier_score_loss_np(y_val, p_val)
        history["d2_val"][epoch] = d2_brier_score_np(y_val, p_val)

        # Gradient
        if loss == "brier":
            grad = (2.0 / X_train.shape[0]) * X_train.T @ (
                (p_train - y_train) * p_train * (1.0 - p_train)
            )
        elif loss == "log":
            # Negative log-likelihood / cross-entropy
            grad = (1.0 / X_train.shape[0]) * X_train.T @ (p_train - y_train)
        else:
            raise ValueError("loss must be 'brier' or 'log'")

        # L2 regularization (skip intercept)
        if l2:
            reg = np.r_[0.0, w[1:]]
            grad = grad + l2 * reg

        # Update
        w = w - lr * grad

    return w, history


w_brier, hist_brier = fit_logistic_gd(
    X_train_i,
    y_train,
    X_val_i,
    y_val,
    loss="brier",
    lr=0.8,
    n_epochs=800,
    l2=1e-3,
    seed=1,
)

w_log, hist_log = fit_logistic_gd(
    X_train_i,
    y_train,
    X_val_i,
    y_val,
    loss="log",
    lr=0.3,
    n_epochs=800,
    l2=1e-3,
    seed=1,
)

w_brier, w_log
(array([0.5236, 3.7584, 1.0648]), array([0.5545, 4.6785, 1.1888]))
epochs = np.arange(len(hist_brier["brier_train"]))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Validation Brier score", "Validation D² Brier"),
)

fig.add_trace(
    go.Scatter(x=epochs, y=hist_brier["brier_val"], mode="lines", name="GD (Brier objective)"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=epochs, y=hist_log["brier_val"], mode="lines", name="GD (Log-loss objective)"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(x=epochs, y=hist_brier["d2_val"], mode="lines", name="GD (Brier objective)"),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(x=epochs, y=hist_log["d2_val"], mode="lines", name="GD (Log-loss objective)"),
    row=1,
    col=2,
)

fig.update_layout(title="Training dynamics (validation)")
fig.update_yaxes(title_text="Brier", row=1, col=1)
fig.update_yaxes(title_text="D²", row=1, col=2)
fig.show()
p_val_brier = sigmoid(X_val_i @ w_brier)
p_val_log = sigmoid(X_val_i @ w_log)
p_val_base = np.full_like(y_val, y_train.mean(), dtype=float)

print("Validation metrics")
print("-" * 60)

for name, p_val in {
    "baseline (train ̅y)": p_val_base,
    "logistic GD (Brier objective)": p_val_brier,
    "logistic GD (Log-loss objective)": p_val_log,
}.items():
    bs = brier_score_loss_np(y_val, p_val)
    d2 = d2_brier_score_np(y_val, p_val)
    print(f"{name:30s}  Brier={bs:.4f}  D²={d2:.4f}")

fig = plot_reliability_diagram(
    y_true=y_val,
    probas_by_name={
        "baseline": p_val_base,
        "GD (Brier objective)": p_val_brier,
        "GD (Log loss objective)": p_val_log,
    },
    title="Calibration on validation set",
)
fig.show()
Validation metrics
------------------------------------------------------------
baseline (train ̅y)             Brier=0.2500  D²=-0.0000
logistic GD (Brier objective)   Brier=0.0410  D²=0.8358
logistic GD (Log-loss objective)  Brier=0.0385  D²=0.8459

8) Pros, cons, and where to use it#

Pros

  • Evaluates probabilities directly (not just class labels)

  • Sensitive to calibration (reliability) and accuracy

  • Proper scoring rule: incentivizes honest probabilities

  • D² gives a normalized scale: “fraction of Brier explained” relative to the base-rate predictor

Cons / pitfalls

  • Needs well-defined probabilities; feeding hard labels (0/1) throws away information

  • If the dataset is extremely imbalanced, \(\bar{y}(1-\bar{y})\) can be very small → D² can become unstable

  • If all labels are identical, the score is undefined (denominator 0)

  • As a training objective with a sigmoid model, Brier loss has an extra \(p(1-p)\) factor in the gradient (can lead to smaller gradients when \(p\) saturates near 0/1 compared to log loss)

Good use cases

  • Risk prediction / decision support where probability calibration matters (medicine, finance, fraud)

  • Comparing probabilistic models against a simple baseline

  • Monitoring probabilistic models over time (drift can show up as calibration degradation)

Diagnostics to pair with it

  • Reliability diagram (calibration curve)

  • Histogram of predicted probabilities (do you predict only around 0.5?)

  • If you also care about ranking/discrimination, add AUC/PR-AUC alongside Brier/D²

9) Exercises#

  1. Create a model that ranks perfectly but is badly calibrated (e.g. apply an aggressive temperature to logits). What happens to AUC vs. Brier/D²?

  2. Implement sample weights and verify that your weighted D² matches sklearn.metrics.r2_score(..., sample_weight=...).

  3. Try Platt scaling or isotonic regression to calibrate a model; measure the change in Brier and D².

References#

  • Brier score (Wikipedia): https://en.wikipedia.org/wiki/Brier_score

  • scikit-learn brier_score_loss: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html

  • scikit-learn r2_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html